Group Names: Cameron Cross, Cathy Cao, Berit Waterfield

Background:

Inspiration:

The central question for this project is: What is the physicochemical component that influences wine quality the greatest? Each of the components alter the quality, but there could possibly be a component that stands out and plays a heavy role in determining a wine’s quality. Additionally, we would like to see what a wine’s input component values look like at each quality level and if certain elements of wine have a “relationship” (tendencies to use more or less of one component when a different component is added).

About the data:

Our data is related to the red and white variants of the Portuguese “Vinho Verde” wine data collected by a team of scientists utilizing machine learning in an attempt to predict human wine taste preferences based on the contents of the wine. The data set includes the physicochemical (inputs) and sensory (the output) variables. The input variables are:

The output variable is:
* quality - output variable (based on sensory data, score between 0 and 10)

Data modifications:

We first created a new data set, mutated_data, that mutated the original data set by changing the quality values to character values. Secondly, we created another data set, wine2, that summarized the original data set with the mean and standard deviation values for each input variable. Lastly, the third data set created, wine3, takes the original data set, groups by the quality value, and again summarizes with the mean and standard deviation of each input component. Wine3 is also used to create smaller data sets that are filtered by quality level for later use.

Data citiation:

Learning, UCI Machine. “Red Wine Quality.” Kaggle, 27 Nov. 2017, www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009.

Mission:

We started by examining the output component, quality, using various methods. Then, each of the input components were examined individually to show various features. A correlation chart was then created to determine if there are any relationships between the components. Lastly, a linear regression is used to determine the variables that influence the quality of wine the most.

Analysis:

First, we take a look at the output variable, quantity

A bar graph is created to see the amount of wine that falls within each quality value.

Next, the data is grouped by quality and the means of the component usages for each quality level are calculated. These means are graphed to display trends that occur with increasing quality

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Next, each of the input variables are examined in detail

Each of the input variables are looked at using their counts at various amounts of usuage, using a box plot that shows their values at the various quality levels, and finally using density distributions that are all isolated by quality values. These give us a good look at the most common values used in general and at each quality level.

Fixed Acidity

Volatile Acidity

Citric Acid

Redisual Sugar

Chlorides

Free Sulfur Dioxide

Total Sulfur Dioxide

Density

pH

Sulphates

Alcohol

Thirdly, a correlation chart is created to unveil relationships between various components

The wine data set is rounded to create a correlation matrix. This is then melted to Var1 (which includes the first set of variables), Var2 (contains the second set of variables, but should be the same because the data set was rounded), and the correlation value. These are then plotted using ggplot.

It can be seen that citric acid and fixed acidity have a relatively strong relationship. In addition, density and fixed acidity and total sulfur dioxide and free sulfur dioxide have relatively strong relationships. Also, pH and fixed acidity, pH and citric acid, and alcohol and density have moderatly strong inverse relationships. It is also worth noting that alcohol has the strongest relationship with quality, which will be examined further next.

We could have determine the most influential components using their relationship to quality in the correlation chart, but to increase our confidence, we want to normalize the variables and use a linear regression model.

Lastly, a linear regression model is used with the normalized variable values to determine the most important components in wine (alter its quality the most when all compenents are added in the same quantity)

The dataset wine is first normalized using this equation

\[ x_{norm} = \left(\frac {x - x_{min}}{x_{max} - x_{min}} \right) \]

After, the normalized values are inputed into the linear regression model and summarized

## 
## Call:
## lm(formula = quality ~ ., data = normalized_wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   5.7126     0.1509  37.846  < 2e-16 ***
## normal_fixed_acidity          0.2824     0.2932   0.963   0.3357    
## normal_volatile_acidity      -1.5820     0.1768  -8.948  < 2e-16 ***
## normal_citric_acid           -0.1826     0.1472  -1.240   0.2150    
## normal_residual_sugar         0.2384     0.2190   1.089   0.2765    
## normal_chlorides             -1.1227     0.2512  -4.470 8.37e-06 ***
## normal_free_sulfur_dioxide    0.3097     0.1542   2.009   0.0447 *  
## normal_total_sulfur_dioxide  -0.9239     0.2062  -4.480 8.00e-06 ***
## normal_density               -0.2435     0.2946  -0.827   0.4086    
## normal_pH                    -0.5253     0.2433  -2.159   0.0310 *  
## normal_sulphates              1.5303     0.1909   8.014 2.13e-15 ***
## normal_alcohol                1.7953     0.1721  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16

In conclusion, using the estimate values (which are the slopes), the greater the absolute value, the greater that variable will change the quality of a wine with an increase or decrease in use. The rankings in order are shown with their slope values included:


Ranking of most important physicochemical components:

  1. Alcohol (1.7953)
  2. Volatile Acidity (-1.5820)
  3. Sulphates (1.5303)
  4. Chlorides (-1.1227)
  5. Total Sulfur Dioxide (-0.9239)
  6. pH (-0.5253)
  7. Free Sulfur Dioxide (0.3097)
  8. Fixed Acidity (0.2824)
  9. Density (-0.2435)
  10. Residual Sugar (0.2384)
  11. Citric Acid (-0.1826)

With these variables ranked, it is clear to see that the alcohol content is crucial to quality, but volatile acidity, sulphates, chlorides, and total sulfur dioxide are aslo major contributors to a wine’s quality. This means that these components must be considered heavily while creating wine in order to achieve a great quality drink.

This raises two last questions: Should the top five physicochemical compenents ranked be the only ones considered due to a sharp dropoff after total sulfur dioxide in terms of thier affect on wine quality? Will increasing or decreasing the amount of an ingredient always alter the quality in the same way?

Interpretation:

Based on our data analysis, it can be determined that alcohol is the most critical physicochemical component when it comes to wine quality. However, the ingredients in rank two to five have a critical role in a wine’s quality as well. After the fifth component, it can be seen that the remaining inputs do not influence wine quality in nearly the same manner so these components can be considered inconsequential for wine quality. These variables increase quality to a certain extent. Obviously adding an abundent amount of alcohol or another component will greatly drop a wine’s quality. It is a great balancing game when making quility wine. These physicochemical components will alter a wine’s quality when added or removed in small amounts, not excess. This means that on a small scale, these rankings should be considered.

With help from the correlation chart, it can be seen that some of the components have relationships. Citric acid and fixed acidity, density and fixed acidity and total sulfur dioxide and free sulfur dioxide have relatively strong relationships, while pH and fixed acidity, pH and citric acid, and alcohol and density have moderatly strong inverse relationships.